Abstract

This project used proteomic data from mass spectrometry, aiming to find out how the phosphorylation of specific proteins in signaling pathways can impact cell differentiation under given conditions. After data manipulation, statistical process, and analysis among databases, associations between special proteins and possible pathways with biological activities were established.

Introduction

Professor Carolina Vieira and her colleagues from the University of Brasília want to know what makes the pluripotent stem cells able to return to an undifferentiated state from a differentiated state, under specific conditions. Post-translation modification can be critical in this process. After treating cells under different conditions to control their pace of differentiation, data were generated by mass spectrometry. The dataset containing only the Phosphorylated proteins is referred to as Phospho, and the whole proteomic dataset with no enrichment of phosphorylation is called Proteome. The target of this project is to obtain a list containing the names of the proteins that increase in phosphorylation and interpret the result.

Materials and Methods

Workflow

Non-differentiated C2C12 mouse muscle cells were cultured in growth media with horse serum to induce differentiation. Leukemia inhibitory factor (LIF) was applied to one group of cells to inhibit differentiation, while the other group (Control) was cultured without LIF.

Before running on mass spectrometry, the protein samples were fractionated. By matching the generated data to the FASTA database of mouse proteins, the peptides and proteins were identified and quantified. Furthermore, aliquots of the two samples are enriched for phosphorylated peptides.

To know if one protein has a higher level of phosphorylation in the LIF-treated sample vs. the control sample, we have to consider both datasets. Only when a protein in the Proteome dataset has no difference between LIF-treated and Control, meanwhile the same protein expressed differently between these two conditions, such protein would be considered differentially phosphorylated.

Conditions to define “more phosphorylated”


Codes and Command lines

The codes and command lines used in this project were separated into several blocks with comments and annotation according to the operation steps in the instruction.

1. Read and name datasets

phospho <- read.csv("./Dataset/Dataset_Phospho.csv" , header = T , sep = ";" , stringsAsFactors = F , fileEncoding = "UTF-8-BOM")
proteome <- read.csv("./Dataset/Dataset_Proteome.csv" , header = T , sep = ";" , stringsAsFactors = F , fileEncoding = "UTF-8-BOM")

The two .csv files were read into two data frames, phospho and proteome respectively.

fileEncoding=“UTF-8-BOM” is to remove “ï..” of the first column.


2. Calculations (for the Phospho dataset)

a. Calculate the average of Abundance columns

phospho$Abundance.Control.Average <- rowMeans(phospho[ ,c("Abundance.Control.1" , "Abundance.Control.2" , "Abundance.Control.3")] , na.rm = TRUE)
phospho$Abundance.LIF.Average <- rowMeans(phospho[ ,c("Abundance.LIF.1" , "Abundance.LIF.2" , "Abundance.LIF.3")] , na.rm = TRUE)

Two new columns in phospho, “Abundance.Control.Average” and “Abundance.LIF.Average”, were appended. The arithmetic mean of LIF and Control group for each row were calculated by the function “rowMeans”. na.rm=TRUE to omit the row with “NA” when calculating.


b. Calculate the ratio (LIF/Control)

phospho$ratio <- phospho$Abundance.LIF.Average / phospho$Abundance.Control.Average

A new “ratio” column was appended. The value of each row is the ratio of the mean of LIF and Control.


c. Calculate the p-values for the ratios (LIF/Control)

temp <- phospho[ ,c(10:15)]
temp <- na.omit(temp)

Create a temporary data frame for following Student’s t test. This temporary data frame contains only Abundance Control 1-3 and Abundance LIF 1-3 without NA. To run the t test properly, NA was removed ahead of schedule in order, which should be done in step 3a.


LIF <- grep("LIF" , colnames(temp) , value = TRUE)
Control <- grep("Control", colnames(temp) , value = TRUE)

p <- apply(temp ,
           1 ,
           function(x)
             t.test(x[LIF] ,
                    x[Control])$p.value)

phospho_no_na <- na.omit(phospho)
phospho_no_na$pvalue <- p

Create two vectors containing the column headers including “LIF” and “Control” respectively. Function “apply” was used to apply function “t.test” on each row of the temporary data frame. Function “t.test” was performed between the “LIF” group and the “Control” group according to the temporary vectors created previously. After the t test, the p.value was stored in a vector. This vector was appended to a new phospho data frame without NA as a column.


3. Cleaning and filtering the dataset

a. Remove peptides without quantitative values (all Abundance rows are empty)

Empty cells and NAs have already been removed in previous step.


b. Create a sub-dataset containing the peptides with ratio ≥ 2.0 from the Phospho dataset, name it “More_Phosphorylated”.

library(dplyr)

More_Phosphorylated <- filter(phospho_no_na , ratio >= 2)

Package “dplyr” was loaded to use its function “filter”. Function “filter” was used to filter in the rows have value greater than 2 in ratio column from phospho_no_na to A new More_Phosphorylated data frame.


4. Protein Interaction Network

a. Download the STRINGdb package and run it in RStudio.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("STRINGdb")

Download and install the “STRINGdb” package if it doesn’t exist in the current system


library(STRINGdb)

Run the “STRINGdb” package after installation


b. Use the string_db$plot_network(yourdata) function.

string_db <- STRINGdb$new(version = "11" , species = 10090)

Create an object link to the STRING database for the following search.

This searching will use stringdb version 11 on Mus musculus (Mouse, Taxon identifier: 10090)

An optional arguments, score_threshold=400 are omitted to set as default for first attempt.


mapped <- string_db$map(More_Phosphorylated , "Master.Protein.Accessions" , removeUnmappedRows = TRUE )

Map(Search) the entries in “Master.Protein.Accessions” column of “More_Phosphorylated” data frame in the STRING database.

A column “STRING_id” appended.

The “removeUnmappedRows” set as True to remove the rows that cannot be mapped to STRING. By default, the argument was FALSE, those unmapped lines are left and their STRING_id is set to NA.


hits <- mapped$STRING_id

Create a vector called “hits”, which contains the mapped STRING_id.


string_db$plot_network( hits )

According to the search result, an interaction plot of the listed protein was generated.

Protein interaction plot of “More_Phosphorylated”


Legend of protein interaction plots

c. Use the sub-dataset “More_Phosphorylated” of highly phosphorylated peptides as an input into the STRING package to obtain a network map of the proteins.

proteome_gt_005 <- filter(proteome , Abundance.Ratio.P.Value...LIF.....Control. > 0.05)

proteome_to_match <- proteome_gt_005
phospho_to_match <- More_Phosphorylated

names(phospho_to_match)[1] <- "Accession"

The entries in the proteome dataset with p.value greater than 0.05 were filtered in a new data frame “proteome_gt_005”

Create 2 data frames containing the existing data to avoid confounding, since the following step is to rename the column name to making them identical between two data frames, make it available for the next step.


MORE_More_Phosphorylated <- inner_join(proteome_to_match , phospho_to_match , by = "Accession")


mapped2 <- string_db$map( MORE_More_Phosphorylated , "Accession" , removeUnmappedRows = TRUE )
hits2 <- mapped2$STRING_id
string_db$plot_network( hits2 )

Function “inner_join” from package “dplyr” was used to merge the two data frames by column “Accession”. As the illustration below, this function will keep the elements from both data frames.

How inner_join works


After the merging step. The similar commands to generate protein interaction plots were run again for a narrowed list.

MORE_More_Phosphorylated


What are the proteins that interact with Stat5b?

Stat5b has direct interaction with Ptpn11, Numa1, and Stat5a when searching in a loose setting.

string_db2 <- STRINGdb$new(version = "11" , species = 10090 , score_threshold = 700)
mapped3 <- string_db2$map(MORE_More_Phosphorylated , "Accession" , removeUnmappedRows = TRUE )
hits3 <- mapped3$STRING_id
string_db2$plot_network( hits3 )

However, Only Stat5a, Ptpn11, Gab1, Sos1 retain strong interaction with Stat5b under stringent condition(score_threshold=700).

Five interaction proteins with strong confidence


d. Alternatively you can input the accessions of the “More_Phosphorylated” list here.

write.table(More_Phosphorylated$Master.Protein.Accessions , file = "More_Phosphorylated.txt" , quote = F, row.names = F, col.names = F)

write.table(MORE_More_Phosphorylated$Accession , file = "MORE_More_Phosphorylated(after_match).txt" , quote = F , row.names = F , col.names = F)

An alternative query can be done in STRING


e. Obtain the Network of protein-interactions image, add it to your answers for the questions below.

Protein-interactions images under different settings can be found above.

Results

Functional enrichment analysis (also known as Gene set enrichment analysis, GSEA) is a method to identify classes of genes or proteins that are over-represented in a large set of genes or proteins, and may have an association with disease phenotypes. There are many online tools to perform Functional Enrichment Analysis or similar analysis, such as KEGG and GENE ONTOLOGY. Also, Enrichr, DAVID, and NetworkAnalyst have developed websites to integrate different databases allowing users to generate and compare the result in several ways. The STRING protein interaction database we used previously also provides a crosslink function to GENE ONTOLOGY and KEGG to do functional enrichments in the result page when you do the search on webpage.

Some inferences that can be made after running the filtered out gene list on different databases. Chronic myeloid leukemia(mmu05220), ErbB signaling pathway(mmu04012), and Jak-STAT signaling pathway(mmu04630) turn to be the 3 most enriched pathways in KEGG 2019 MOUSE database. This result coincides with the STRINGdb protein interaction result.

Enrichr KEGG_2019_Mouse

The 5 proteins (Stat5b, Stat5a, Ptpn11, Gab1, Sos1) retain interaction under stringent settings (score_threshold=700) play exact important roles in these 3 pathways. Ptpn11 is in Chronic myeloid leukemia and Gab1 is in ErbB signaling pathways, while the other 3 gene involves both CML and ErbB signaling pathway.

Enrichr clustergram

In GENE ONTOLOGY database, these 5 genes were analyzed in 3 categories. In Biological Process(BP), the 5 genes are enriched in Positive regulation of mitotic cell cycle(GO:0045931), Regulation of multicellular organism growth(GO:0040014), and Epidermal growth factor receptor signaling pathway(GO:0007173). Epidermal growth factor receptor (EGFR) is also known as Erbb or Erbb1. In mice, binding of the ligand of EGFR triggers homo- and/or heterodimerization of the receptor triggering its autophosphorylation, heterodimer with ERBB2. In Molecular Function(MF), these genes involve protein binding(GO:0005515) and signal transducer activity(GO:0004871) These 2 terms show with FDR around 0.1 even they are top 2 hits in this category. In Cell Component(CC), they all exist in cytosol(GO:0005829)

Category Term Description Genes PValue FDR
GO_BP_DIRECT GO:0045931 positive regulation of mitotic cell cycle STAT5A, STAT5B, PTPN11 0.0000182 0.0013764
GO_BP_DIRECT GO:0040014 regulation of multicellular organism growth STAT5A, STAT5B, PTPN11 0.0000257 0.0013764
GO_BP_DIRECT GO:0007173 epidermal growth factor receptor signaling pathway GAB1, PTPN11, SOS1 0.0000362 0.0013764
GO_MF_DIRECT GO:0005515 protein binding STAT5A, STAT5B, GAB1, PTPN11, SOS1 0.0030232 0.0876737
GO_MF_DIRECT GO:0004871 signal transducer activity STAT5A, STAT5B, GAB1 0.0078629 0.1140126
GO_CC_DIRECT GO:0005829 cytosol STAT5A, STAT5B, PTPN11, SOS1 0.0027806 0.0361478
KEGG_PATHWAY mmu05220 Chronic myeloid leukemia STAT5A, STAT5B, PTPN11, SOS1 0.0000031 0.0000943
KEGG_PATHWAY mmu04012 ErbB signaling pathway STAT5A, STAT5B, GAB1, SOS1 0.0000056 0.0000943
KEGG_PATHWAY mmu04630 Jak-STAT signaling pathway STAT5A, STAT5B, PTPN11, SOS1 0.0000259 0.0002940

Discussion

Five out of 60 proteins show significant interaction after filter

Amongst the 5428 entries of phospho dataset and 6146 entries of proteome dataset, only 68 entries were retained after the procedure of data cleaning, calculation, filter, and matching. Eight duplications were removed when performing STRING. One protein may break down into different peptides before loading to mass spectrometry, which can be the source of duplications. According to the result (plot and data) from STRING analysis, 5 out of 60 proteins, Stat5b, Stat5a, Ptpn11, Gab1, Sos1, were regarded as key roles.

Questions in Instruction

1) How many proteins, in the Proteome dataset had ratios statistically significantly different?

dim(filter(proteome , Abundance.Ratio.P.Value...LIF.....Control. < 0.05))[1]

There are 294 entries with p-value less than 0.05 in proteome dataset.

All adjusted p-value in proteome dataset is greater than 0.85.


2) What quality parameters did you use in your analysis?

For STRINGdb search, 2 different score threshold were applied, 400 for default and 700 for stringent respectively.

The statistical significance threshold(α) for testing in this project was set as 0.05.


3) How many peptides in the Phospho dataset where more phosphorylated?

In the phospho dataset, there are 87 peptides (79 if remove 8 duplications) were listed in more phosphorylated in step 3b, by filtering the ratio greater than 2.

If match the filter result with the proteome dataset as the criteria mentioned in the introduction section, 68 (60 if remove 8 duplications) should be defined as more phosphorylated for both two datasets.


4) When you analyze the proteins in STRING or GO, what are the most enriched pathways?

Chronic myeloid leukemia(mmu05220) and Jak-SATA signaling pathway(mmu04630) include STAT5A, STAT5B, PTPN11, SOS1 and ErbB signaling pathway(mmu04012) includes STAT5A, STAT5B, GAB1, SOS1.


5) What were the conclusions of Professor Carolina Vieira after seeing the results? (mandatory)

Stat5b, Stat5a, Ptpn11, Gab1, Sos1, these 5 proteins involved in critical biological activities with low false discovery rate in enrichment analysis. It implies they may play important roles in myeloid malignancy development and phosphorylation of ErbB signaling pathway. Details can be found in the Result and Discussion sections above.